(ACMEGRADE INTERNSHIP)

Name : Dhanshree Bhimsing Rajput

Batch: Data Science(Self paced) December'23

Project1 Name : Detection of Parkinson's Disease

parkinson-disease-symptoms-infographic_1308-48394.jpg

Summary :

Building and evaluating a Gradient Boosting Classifier model for predicting Parkinson's disease based on some features provided in the dataset. Here's a summary of what's happening in the code:

  1. Data Exploration: Initially, the code visualizes the distribution of numeric features using Seaborn's histplot function. This helps in understanding the skewness and distribution of each numeric feature.

  2. Correlation Analysis: Next, the code calculates the correlation matrix among the features and visualizes it as a heatmap using Seaborn's heatmap function. This helps in identifying correlations between different features, which can provide insights into potential relationships within the data.

  3. Data Preprocessing: The 'name' column is dropped from the dataset, assuming it's not contributing to the predictive task. Then, the dataset is split into features (X) and the target variable (Y).

  4. Model Training and Evaluation: The dataset is split into training and testing sets using the train_test_split function from scikit-learn. A Gradient Boosting Classifier model is trained on the training data and evaluated on both the training and testing sets.

  5. Evaluation Metrics: Various evaluation metrics such as accuracy, confusion matrix, recall, classification report, and Cohen's Kappa score are calculated and printed to assess the performance of the model on both the training and testing sets.

  6. Pickling: Finally, there's a mention of creating a pickle file, presumably to save the trained model for future use without having to retrain it every time.

Parkinson's Disease: Parkinson's disease is a neurodegenerative disorder that primarily affects movement. It is characterized by symptoms such as tremors, stiffness, slow movements, and impaired balance. Early diagnosis and treatment can help manage symptoms and improve the quality of life for individuals with Parkinson's disease. Machine learning models like the one built in this code can potentially assist in diagnosing Parkinson's disease based on relevant features extracted from patient data.

Install the following Libraries
Import Libraries
Check Current Directory
Change the directory
Read Data, display records
Attribute Information: Target column - Status

Matrix column entries (attributes):

name - ASCII subject name and recording number

MDVP:Fo(Hz) - Average vocal fundamental frequency

MDVP:Fhi(Hz) - Maximum vocal fundamental frequency

MDVP:Flo(Hz) - Minimum vocal fundamental frequency

MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency

MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude

NHR,HNR - Two measures of ratio of noise to tonal components in the voice

$$status - Health status of the subject (one) -Parkinson's, (zero) - healthy$$

RPDE,D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent

spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Pandas Profiling Report
Display the shape
Number of rows
Display the data type of all columns
Display Details
Describe the details
Check for Null Values
Display column details
Display the dependent variable
Create Histogram with Status column
Create Bar graph- X-Axis Status, Y- Axis NHR

The patients affected with Parkinson's disease have high NHR which is the measure of the ratio of noise to tonal components in the voice.

Create Bar graph- X-Axis Status, Y- Axis RPDE

The nonlinear dynamical complexity measure RPDE is high in the patients affected with Parkinson's disease.

Create Distribution plot – This used to check skewness in data

In this modification, I've use Seaborn's histplot function, displayed the correlation matrix using a heatmap, and adjusted the splitting of the dataset into train and test sets.

Display Co relation Matrix
Display Heatmap
Heatmap with Default Parameters
Spitting the dataset into x and y
Create X
Create -Y
Splitting the data into x_train, y_train, x_test, y_test
Random Forest Regression

This code trains a Random Forest regressor using RandomForestRegressor from sklearn.ensemble and evaluates its performance using mean squared error and R2 score instead of accuracy.

Since this is regression, there's no confusion matrix or accuracy score. Instead, mean squared error and R2 score are used as evaluation metrics.

Create a Gradient Boosting Classifier Model

Wrong Predictions made.
Kappa Score
Display the test and Predicted Values
Transpose and display
Create Pickle File
AdaBoost Classifier
Wrong Prediction and Kappa Score
Kappa Score
Gaussian Naive Bayes:
Wrong Prediction and Kappa Score
Multi-layer Perceptron (MLP) classifier
Wrong Prediction and Kappa Score
Support Vector Machine
Wrong Prediction and Kappa Score
Create Pickle File